Conversation
| import torch.distributed as dist | ||
| world_size = test.args["world_size"] | ||
| os.environ["MASTER_ADDR"] = "127.0.0.1" | ||
| os.environ["MASTER_PORT"] = "12356" |
There was a problem hiding this comment.
might wanna setup a large random port, if for some reason a job fails and the port doesn't get released
There was a problem hiding this comment.
This is a really cool PR, I don't think I've seen anyone recently manually manage process pools as opposed to just calling torchrun. I think this solution works quite nicely for us though because we don't care (yet) about multi node or fault tolerance
Idk if he has time but just in case @kiukchung would really appreciate your feedback on whether this is a reasonable way of supporting a job system where people submit distributed kernels
The part I'm a bit worried about is the example right now can check that a kernel is globally correct by iterating over many ranks and making sure they're locally correct. That's probably not going to be true for any of the distributed AMD kernels. This is fine as long as we offload the burden of gathering all the outputs onto the user
| check_copy = _clone_data(data, rank) | ||
|
|
||
| # first, one obligatory correctness check | ||
| output = custom_kernel(_clone_data(data, rank)) |
There was a problem hiding this comment.
I'd like to confirm, if that, users' custom_kernel should be designed to an API accepting and outputing single rank data? but my ref_kernel is accepting all rank data and outputing all rank result...is that conflicted or something need to be changed in my PR? https://github.com/gpu-mode/reference-kernels/pull/51/files#diff-4634bd7a4a47ab89859ee0db3f4f3f3c8123cf18981fb4d24b4655a412777013R240
There was a problem hiding this comment.
I think essentially, the user kernel should be what is called _worker in your code. I wanted to avoid passing tensor objects across process boundaries with the multiprocessing module; I see in your example, you also transfer the tensor to the CPU before returning from worker.
There was a problem hiding this comment.
yes, _worker is actually the kernel for each rank to run. In your case, what you expect people submit is the custom_kernel including worker only? does this also work based on your infra? In other words, does the submission.py(https://github.com/gpu-mode/reference-kernels/pull/51/files#diff-e7f1884ce42d6b30a70551a1a922b2547b8cb7b91e959a2ceb6f4471740dde95) work based on your infra?
There was a problem hiding this comment.
big thanks, I saw siro's commit to my PR, which is compatible with this PR
my first attempt was to just If you cannot do the check locally, there's nothing preventing us from having all-gather calls inside the validation function. |
e5b7a0d to
e2071bd
Compare
Managing your own processes is perfectly valid. The use-case here seems to be data-parallel eval in a single host (but multi-gpu) where the For completeness I'll mention that |
71c40c1 to
54f7716
Compare
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||||||||
776931b to
7f8c48f
Compare
Implement multi-gpu tasks:
design:
run_eval still calls only a single instance of eval.py, but if the task is multi-gpu, eval.py will create a pool of multiple worker processes, and call the user op on each with a different rank argument.
in the current code, input generation and testing is also handled by each process independently; I'm not sure if that's the right approach.